NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Shah, Aashaka; Chidambaram, Vijay; Cowan, Meghan; Maleki, Saeed; Musuvathi, Madan; Mytkowicz, Todd; Nelson, Jacob; Saarikivi, Olli; Singh, Rachee (April 2023, USENIX)

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as ALLTOALL and ALLREDUCE, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%–2.3x for different batch sizes.
more » « less
Full Text Available
Breaking the computation and communication abstraction barrier in distributed machine learning workloads

https://doi.org/10.1145/3503222.3507778

Jangda, Abhinav; Huang, Jun; Liu, Guodong; Sabet, Amir Hossein; Maleki, Saeed; Miao, Youshan; Musuvathi, Madanlal; Mytkowicz, Todd; Saarikivi, Olli (February 2022, ASPLOS 2022: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone. Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.
more » « less
Full Text Available
Niijima: sound and automated computation consolidation for efficient multilingual data-parallel pipelines

https://doi.org/10.1145/3341301.3359649

Xu, Guoqing Harry; Veanes, Margus; Barnett, Michael; Musuvathi, Madan; Mytkowicz, Todd; Zorn, Ben; He, Huan; Lin, Haibo (October 2019, Proceedings of the 27th ACM Symposium on Operating Systems Principles)

Full Text Available

Search for: All records